Summary of work completed

Wrangling and data cleaning

  • Merged metadata sources from all field collections
  • Various cleaning and whatnot
  • Spatial clustering of points to assign “populations”
    • For now, all analyses are done at a 100km spatial clustering. This is very crude, but easy to understand and work with. Likely, we’ll want to change this to a closer buffer range (like 20km)
  • Merged clustered metadata with genotype data

COLONY/pedigree analysis

  • Prepared COLONY input data
  • Ran COLONY for each 100km cluster, separated by year (i.e. excluded “impossible” relationships, made COLONY run smooth that way!)
  • Wrangled COLONY outputs and merged it back to the genotype data
    • Although colonies are assigned at the 100km cluster, they can easily be examined at smaller spatial scales (e.g. filter dataset to only look at MN Zoo even though that’s clustered with Twin Cities generally. Then can examine # of unique colonies at the zoo, etc)
    • A quick check with the MN Zoo shows that this works (there are not families detected outside of expected distances)

Preliminary genetic analysis

  • Various quality checks and summary statistics
  • Preliminary versions of Fst, He, Ho, PAR
    • Beginning some of these as a function of pairwise distance, longitude, etc.
    • Again, doing this at the 100km cluster level which is crude but easy to communicate/aligns clearly with COLONY input

Outstanding issues and points of confusion


Next Steps


Preliminary Results

Data Availability and Map of collections

  • Total specimens in USDA database: 665
  • Total specimens with ANY genetic data in USDA database: 498
  • Specimens after basic filtering: 470
    • Had known lat-long, sex, and other basic metadata
    • Was from 2020 or 2021
    • Other minor filtering
  • Specimens used in COLONY: 308
    • Females only
    • Had at least 10 loci with data (vast majority have 13)
    • Was NOT from a known colony (this removes quite a few specimens)
  • Specimens in pop-gen analyses: 231
    • Same filters as COLONY and
    • Only one individual per detected colony
  • Number of males: 116

Figure 1. Map showing the collection locations of 470 specimens available after basic initial filtering. Colors represent “population cluster” as assigned by tree method (used hclust and cutree; tree length = 100km). Hover over the dots to see what cluster they were assigned to!

Quality checks and summary statistics

HW Equilibrium

See Shalene’s paper on page 998 for inspiration on how to deal with this…Jha et al. 2015 Contemporary human-altered landscapes and oceanic barriers reduce bumble bee gene flow

##               chi^2  df  Pr(chi^2 >) Pr.exact
## btern01    97.66759  36 1.343170e-07    0.000
## bt28       59.31202  10 4.889876e-09    0.000
## b96       105.11921  28 7.376899e-11    0.000
## bt30      113.92536  66 2.288810e-04    0.000
## btms0081   13.89628   3 3.049781e-03    0.004
## btms0066  421.14552 276 3.999226e-08    0.000
## btms0083 1010.34228 465 0.000000e+00    0.000
## b126      313.28115 120 0.000000e+00    0.000
## btms0062 1146.60599 561 0.000000e+00    0.000
## btern02   954.22326 300 0.000000e+00    0.000
## btms0086   50.77823   3 5.454315e-11    0.000
## bl13       58.50354   3 1.227130e-12    0.000
## btms0059  112.09174  36 9.687809e-10    0.001

There is evidence that, globally, loci are out of HWE. But this is due to differences between subpopulations. Shalene handles this nicely in the paper noted above, so it’s something to report…but not really anything to worry about downstream.

Linkage Disequilibrium

LD between loci - not likely an issue here (p.rD and p.Ia > 0.05; no consistent linkage in the pairwise comparison either).

##          Ia        p.Ia       rbarD        p.rD 
## 0.054012341 0.066000000 0.004580062 0.065000000

Null alleles

Null allele frequences are within the range typical for pop gen analysis with microsatellites and unlikely to be an issue. Can cite https://www.nature.com/articles/6800545 and write it like they do in https://peerj.com/articles/13565/ where they say, “Nevertheless, the frequency of null alleles inferred with the methods of Brookfield and Chakraborty (0.09 and 0.13, respectively) is in line with values commonly reported in the literature and is unlikely to cause a major bias in downstream population structure analyses (Dakin & Avise, 2004).”

##                       btern01       bt28        b96       bt30   btms0081
## Observed frequency 0.07366541 0.07133358 0.09233988 0.09316057 0.06801660
## Median frequency   0.07220547 0.06881528 0.09217080 0.09042780 0.06652242
## 2.5th percentile   0.03414292 0.03186540 0.04796458 0.05437336 0.02477958
## 97.5th percentile  0.11114992 0.11613681 0.13765518 0.13684859 0.11268061
##                      btms0066   btms0083       b126   btms0062   btern02
## Observed frequency 0.06941596 0.13619032 0.09291617 0.12542148 0.1480633
## Median frequency   0.06893741 0.13394953 0.09167348 0.12331952 0.1463794
## 2.5th percentile   0.03486506 0.09632793 0.05478723 0.08839799 0.1082095
## 97.5th percentile  0.10703151 0.17551051 0.13338109 0.16317304 0.1893019
##                      btms0086       bl13   btms0059
## Observed frequency 0.05457863 0.08153726 0.06513655
## Median frequency   0.05262534 0.08252742 0.06475930
## 2.5th percentile   0.01777369 0.03837588 0.02344977
## 97.5th percentile  0.09095684 0.13209085 0.10574397

Missing loci

There are no loci with <80% completeness

## named numeric(0)

Individuals with poor data

There are no individuals with poor data in the dataset…but that is because I also excluded them earlier in the process. So this is just a sanity check.

## named numeric(0)

Duplicates or clones?

There are no duplicates or clones in the dataset. It’s possible that there are prior to the COLONY filtering, so if re-running analysis keeping siblings in the dataset, double check.

## #############################
## # Number of Individuals:  231 
## # Number of MLG:  231 
## #############################
## [1] 231

Polymorphic

Just a little sanity check that all the loci are polymorphic (they are)

##    Mode    TRUE 
## logical      13

Preliminary Population Genetics Analysis

Summary Statistics

Table of various summary stats

From the summary call to the genind object, we find the number of specimens per cluster, the number of private alleles (which is not very informative as it’s correlated with sample size), and the rarefied allelic richness (which may be informative). It’s quite debatable how well the rarefied richness does when dealing with such uneven population sizes. This is a case where re-running with different clusters might be more effective (i.e. splitting apart the big ones and then dropping the small ones for this analysis)

Observed v. Expected Heterozygosity

Only displaying populations with 5 or more specimens.

Figure 2. Observed (blue) versus expected (grey) heterozygosity across the regions. Numbers above bars are the sample size for each region.

Same (except ALL clusters, not just those with >=5 specimens) as figure but in table format for investigating to your heart’s content. Added some math for a little helper.

F-Statistics

Pairwise Fst heat map. Not sure what’s up with the SE Minnesota population or Green Bay…but there’s also not that many specimens from them (8 and 7, respectively)

Below is with only populations having >=5 specimens

## 
## Call:
## lm(formula = Fst ~ distance, data = df_fst_join)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032784 -0.014806 -0.004775  0.007285  0.093444 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.704e-02  4.725e-03   3.607 0.000686 ***
## distance    3.819e-08  9.292e-09   4.110 0.000138 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0225 on 53 degrees of freedom
## Multiple R-squared:  0.2417, Adjusted R-squared:  0.2274 
## F-statistic: 16.89 on 1 and 53 DF,  p-value: 0.0001381
## 
## Call:
## lm(formula = Fst ~ log(distance), data = df_fst_join)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032783 -0.014908 -0.005903  0.005882  0.088357 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.178853   0.048422  -3.694 0.000524 ***
## log(distance)  0.016773   0.003846   4.361 5.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02217 on 53 degrees of freedom
## Multiple R-squared:  0.2641, Adjusted R-squared:  0.2502 
## F-statistic: 19.02 on 1 and 53 DF,  p-value: 5.994e-05

UPDATE!!!!!!!!!!!!!!!!!!! COLONY abundance and results

Insert summary table, case studies of MnZoo, Turtle Valley, and other sites with >15? 10? whatever

MN Zoo 2020

## # A tibble: 1 × 6
##   n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
##         <int>      <int>         <int>    <dbl>    <dbl>         <dbl>
## 1          18          9            16        9       31         0.562

MN Zoo 2021

## # A tibble: 1 × 6
##   n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
##         <int>      <int>         <int>    <dbl>    <dbl>         <dbl>
## 1          16         13            39     19.5     1000         0.333

Turtle Valley 2021

## # A tibble: 1 × 6
##   n_genotyped n_colonies ml.colony.num CI.lower CI.upper prop_detected
##         <int>      <int>         <int>    <dbl>    <dbl>         <dbl>
## 1          42         27            66     49.5     121.         0.409

UPDATE!!!!!!!!!!!!!!!!!!! Analysis of Males

Determine numbers of males using the minimum homozygosity method, discuss implications

## # A tibble: 4 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female no        20
## 2 female yes      327
## 3 male   no        52
## 4 male   yes       63
## [1] 0.1615385
## # A tibble: 4 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female no        29
## 2 female yes      318
## 3 male   no        89
## 4 male   yes       26
## [1] 0.0755814
## # A tibble: 4 × 3
##   site        threshold prop_diploid
##   <chr>           <dbl>        <dbl>
## 1 MN Zoo              1        0.175
## 2 MN Zoo              2        0.114
## 3 Appalachian         1        0.523
## 4 Appalachian         2        0.3

MNZoo 2020

## # A tibble: 3 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female yes       18
## 2 male   no         8
## 3 male   yes        1
## [1] 0.05263158
## # A tibble: 3 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female yes       18
## 2 male   no         8
## 3 male   yes        1
## [1] 0.05263158

mnzoo 2021

## # A tibble: 4 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female no         1
## 2 female yes       15
## 3 male   no         3
## 4 male   yes        6
## [1] 0.2857143
## # A tibble: 4 × 3
##   sex    is_het     n
##   <chr>  <chr>  <int>
## 1 female no         3
## 2 female yes       13
## 3 male   no         6
## 4 male   yes        3
## [1] 0.1875